{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dense-X-Retrieval Pack\n",
    "\n",
    "This notebook walks through using the `DenseXRetrievalPack`, which parses documents into nodes, and then generates propositions from each node to assist with retreival.\n",
    "\n",
    "This follows the idea from the paper [Dense X Retrieval: What Retreival Granularity Should We Use?](https://arxiv.org/abs/2312.06648).\n",
    "\n",
    "From the paper, a proposition is described as:\n",
    "\n",
    "```\n",
    "Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format.\n",
    "```\n",
    "\n",
    "We use the provided OpenAI prompt from their paper to generate propositions, which are then embedded and used to retrieve their parent node chunks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
    "\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 13.0M  100 13.0M    0     0  1160k      0  0:00:11  0:00:11 --:--:-- 1574k\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/'\n",
    "!curl 'https://arxiv.org/pdf/2307.09288.pdf' -o 'data/llama2.pdf'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/loganm/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /home/loganm/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n",
      "WARNING: CPU random generator seem to be failing, disabling hardware random number generation\n",
      "WARNING: RDRND generated: 0xffffffff 0xffffffff 0xffffffff 0xffffffff\n",
      "/home/loganm/.cache/pypoetry/virtualenvs/llama-hub-86aDfznI-py3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from llama_hub.file.unstructured import UnstructuredReader\n",
    "\n",
    "documents = UnstructuredReader().load_data(\"data/llama2.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the DenseXRetrievalPack\n",
    "\n",
    "The `DenseXRetrievalPack` creates both a retriver and query engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llama_pack import download_llama_pack\n",
    "\n",
    "DenseXRetrievalPack = download_llama_pack(\"DenseXRetrievalPack\", \"./dense_pack\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 90/90 [02:02<00:00,  1.36s/it]\n",
      "Generating embeddings: 100%|██████████| 2210/2210 [00:13<00:00, 159.12it/s]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.llms import OpenAI\n",
    "from llama_index.text_splitter import SentenceSplitter\n",
    "\n",
    "dense_pack = DenseXRetrievalPack(\n",
    "    documents,\n",
    "    proposition_llm=OpenAI(model=\"gpt-3.5-turbo\", max_tokens=750),\n",
    "    query_llm=OpenAI(model=\"gpt-3.5-turbo\", max_tokens=256),\n",
    "    text_splitter=SentenceSplitter(chunk_size=1024),\n",
    ")\n",
    "dense_query_engine = dense_pack.query_engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "base_index = VectorStoreIndex.from_documents(documents)\n",
    "base_query_engine = base_index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test Queries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How was Llama2 pretrained?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Llama 2 was pretrained using an optimized auto-regressive transformer. Several changes were made to improve performance, including more robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Markdown, display\n",
    "\n",
    "response = dense_query_engine.query(\"How was Llama2 pretrained?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Llama 2 was pretrained using an optimized auto-regressive transformer. The pretraining approach involved robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. The pretraining methodology and training details are described in more detail in the provided context."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = base_query_engine.query(\"How was Llama2 pretrained?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What baselines are used to compare performance and accuracy?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The baselines used to compare performance and accuracy are Llama 1, Falcon, and MPT. These models are compared with Llama 2 to evaluate its advancements in terms of safety and other important aspects."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = dense_query_engine.query(\n",
    "    \"What baselines are used to compare performance and accuracy?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The baselines used to compare performance and accuracy are the MPT and Falcon models. These models are compared against the evaluation framework and any publicly reported results to determine the best score."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = base_query_engine.query(\n",
    "    \"What baselines are used to compare performance and accuracy?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What datasets were used for measuring performance and accuracy?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The datasets used for measuring performance and accuracy include HumanEval, MBPP, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, CommonsenseQA, NaturalQuestions, TriviaQA, SQuAD, GSM8K, and MATH."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = dense_query_engine.query(\n",
    "    \"What datasets were used for measuring performance and accuracy?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The datasets used for measuring performance and accuracy include HumanEval, MBPP, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, CommonsenseQA, NaturalQuestions, TriviaQA, SQuAD, QuAC, BoolQ, GSM8K, and MATH."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = base_query_engine.query(\n",
    "    \"What datasets were used for measuring performance and accuracy?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index-4a-wkI5X-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
